AICE1006 - Data Analytics¶

Lecture 7 - Data Plotting (Advanced)¶

Interactive data visualization with plotly

Zhiwu Huang
Lecturer (Assistant Professor)
Vision, Learning and Control (VLC) Research Group
School of Electronics and Computer Science (ECS)
University of Southampton

Office Hour: Wed 2PM-3PM, Please book in advance.
Zhiwu.Huang@soton.ac.uk



Credit: Marco Forgione, Researcher, USI-SUPSI

Plotly in a nutshell¶

Plotly is a modern plotting library for Python, R, MATLAB, Julia, etc.

For Python, the reference documentation is available at https://plotly.com/python/

Plotly vs matplotlib¶

You can build high-quality visualizations with good old matplotlib. However,

  • A lot of low-level code is required
  • The visualizations are generally static

Plotly is a modern and powerful alternative. It provides:

  • Concise high-level syntax for common data visualization
  • Tight integration with pandas
  • Interactive plots

Other alternatives exist: for instance seaborn

  • Also concise and high-level
  • Also integrated with pandas
  • Not interactive

Plotly Express¶

The plotly express sub-module of plotly provides a high-level API for common visualizations. Covers many use cases.

In [1]:
import plotly.express as px 
# import plotly # contains more advanced low-level functionalities for custom visualizations

Plotly express provides methods to load well-known datasets. Let us load the iris dataset

In [2]:
df_iris = px.data.iris() # several classic dataframes are included in plotly for demonstration purpose
df_iris.sample(5)
Out[2]:
sepal_length sepal_width petal_length petal_width species species_id
131 7.9 3.8 6.4 2.0 virginica 3
76 6.8 2.8 4.8 1.4 versicolor 2
56 6.3 3.3 4.7 1.6 versicolor 2
21 5.1 3.7 1.5 0.4 setosa 1
88 5.6 3.0 4.1 1.3 versicolor 2

Scatterplot¶

A scatterplot is the most common visualization for 2 numeric variables

In [3]:
fig = px.scatter(df_iris, x="petal_width", y="petal_length", width=1600, height=800) # specify dataframe and columns for x/y
fig.update_layout(font_size=20);
fig.show()
  • Syntax: px.scatter(df_iris, x="petal_width", y="petal_length", ...)
  • Axes labels automatically set to the column names
  • Interactive!

Scatterplot cont'd¶

The marker color is commonly used as another dimension of visual analysis

In [4]:
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species", width=1600, height=800) # specify dataframe and columns for x/y
fig.update_layout(font_size=20);
fig.show()
  • Implemented with color="species"
  • Legend automatically added

Scatterplot cont'd¶

The marker size provides yet another dimension of visual analysis

In [5]:
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species", size="petal_width", width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()
  • Implemented with size="petal_length"

Scatterplot cont'd¶

The interactive text displayed when hovering over a point may also be modified

In [6]:
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species", 
                 size="sepal_width", hover_data=["sepal_length"], width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()
  • Implemented with hover_data=["petal_width"]

Scatterplot matrix¶

The scatterplot matrix is a useful visualization for several numeric variables. It is the collection of all possible combinations of scatterplots.

In [7]:
fig = px.scatter_matrix(df_iris, dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"], 
                        color="species", width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()
  • Implemented with px.scatter(...)
  • The variables to be analyzed correspond to the dimensions argument

Histograms & Box plots¶

Histograms & box plots may be used to represent the distribution of a single numerical variable

In [8]:
fig = px.histogram(df_iris, x="sepal_width", width=800, height=400)
fig.update_layout(font_size=20);
fig.show()
In [9]:
fig = px.box(df_iris, x="sepal_width", width=800, height=400)
fig.update_layout(font_size=20);
fig.show() 

Multiple box plots¶

Multiple box plots may be constructed specifying a categorical variable for y...

In [10]:
fig = px.box(df_iris, x="sepal_width", y="species", width=800, height=400); fig.update_layout(font_size=20); fig.show() 

... or for color

In [11]:
fig = px.box(df_iris, x="sepal_width", color="species", width=800, height=400); fig.update_layout(font_size=20); fig.show() 

Multiple box plots cont'd¶

Note: the role of x and y may be interchanged

In [12]:
fig = px.box(df_iris, y="sepal_width", x="species", width=800, height=400); fig.update_layout(font_size=20); fig.show() 
In [13]:
fig = px.box(df_iris, y="sepal_width", color="species", width=800, height=400); fig.update_layout(font_size=20); fig.show() 

Bar plot¶

Bar plots are commonly used to represent a numeric variable vs. a categorical one. Example: aggregated group statistics

In [14]:
df_iris_mean = df_iris.groupby("species", as_index=False).mean()
df_iris_mean
Out[14]:
species sepal_length sepal_width petal_length petal_width species_id
0 setosa 5.006 3.418 1.464 0.244 1
1 versicolor 5.936 2.770 4.260 1.326 2
2 virginica 6.588 2.974 5.552 2.026 3
In [15]:
fig = px.bar(df_iris_mean, x="species", y="petal_length", title="Average petal_length, by species"); fig.update_layout(font_size=20); fig.show() 

Bar plot¶

Another example where a bar plot looks nice: data for different years

In [16]:
import plotly.express as px
data_canada_it = px.data.gapminder().query("country == 'Canada' or country == 'Italy'")
fig = px.bar(data_canada_it, x='year', y='pop', color="country", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 
#data_canada_it

Bar plot¶

Another example where a bar plot looks nice: data for different years

In [17]:
import plotly.express as px
data_canada_it = px.data.gapminder().query("country == 'Canada' or country == 'Italy'")
fig = px.bar(data_canada_it, x='year', y='pop', color="country", barmode="group", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 

Italian population is stable since the 80s, canadian population is still increasing

Pie chars¶

Pie charts give an intuitive representation of percentages.

In [18]:
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
df.loc[df['pop'] < 5.e6, 'country'] = 'Other countries' # Represent only large countries
df.sample(3)
Out[18]:
country continent year lifeExp pop gdpPercap iso_alpha iso_num
527 Finland Europe 2007 79.313 5238460 33207.08440 FIN 246
683 Hungary Europe 2007 73.338 9956108 18008.94444 HUN 348
779 Italy Europe 2007 80.546 58147733 28569.71970 ITA 380
In [19]:
fig = px.pie(df, values='pop', names='country', title='Population of European continent', width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 

Faceting¶

Faceting allows dealing with up to two categorical variables by repeating the same base plot on different rows/ columns.

Back to the tip dataset:

In [20]:
df_tip = px.data.tips()
df_tip.head(5)
Out[20]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
  • total_bill and tip are numeric quantities
  • day and time are also (ordered) categorical
  • sex and smoker are categorical variables with unspecified order

Simple scatterplot¶

How is the relation tip vs total_bill for the different days? We may use a scatterplot tip vs total_bill, colored by day.

In [31]:
fig = px.scatter(df_tip, x="total_bill", y="tip", color="day", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 

The result is not very clear...

Faceted Scatterplots¶

A facet columns may be used instead: generate separate plots for each day

In [32]:
fig = px.scatter(df_tip, x="total_bill", y="tip", facet_col="day", category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]}, width= 1800, height=500)
fig.update_layout(font_size=20); fig.show() 
  • facet_col="day": repeat the scatterplot for the different values of the categorical variable day on columns
  • The category_orders dictionary specifies the order to be used for the categorical variables

Faceted Statterplots cont'd¶

Using facet rows and columns we may handle 2 categorical variables

In [33]:
fig = px.scatter(df_tip, x="total_bill", y="tip", facet_col="day", facet_row="time", 
                 category_orders={"day": ["Thur", "Fri", "Sat", "Sun"], "time": ["Lunch", "Dinner"]},
                 width= 1600, height=700)
fig.update_layout(font_size=20); fig.show() 
  • facet_col="day": day on columns
  • facet_row="time": time on rows

Faceted Histograms¶

Histograms may also be modified with faceting

In [34]:
fig = px.histogram(df_tip, x="total_bill",  facet_col="day", facet_row="smoker", color="sex",                 
                   category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]},
                   width= 1600, height=800, )
fig.update_layout(font_size=20); fig.show() 
  • 1 categorical variable (sex) handled with the color option
  • 2 categorical variables (day/smoker) handled with rows/columns

Faceted Boxplot¶

In [35]:
fig = px.box(df_tip, x="day", y="total_bill",
             facet_col="smoker",
             category_orders={"day": ["Thur", "Fri", "Sat", "Sun"], "time": ["Lunch", "Dinner"]},
             color="day",
             width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 

Animation: time as an extra dimension¶

In the following scatterplot, we visualize 4 properties for different countries in 2007 :

  • gdpPercap (x position)
  • lifeExp (y position)
  • continent (marker color)
  • population (marker size)
In [36]:
import plotly.express as px
df = px.data.gapminder()
fig = px.scatter(df.query("year==2007"), x="gdpPercap", y="lifeExp", size="pop", color="continent", hover_name="country", log_x=True,
           title="GDP, life expectancy, continent, and population of countries in 2007", size_max=60, width=1400, height=600)
fig.update_layout(font_size=20); fig.show() 

What if we want to see the evolution over time? An animation could be used!

Animation: time as an extra dimension¶

In [37]:
import plotly.express as px
df = px.data.gapminder()
fig = px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", 
           size="pop", color="continent", hover_name="country", 
           log_x=True, size_max=45, range_x=[100,100000], range_y=[25,90],
           width=1400, height=600)
fig.update_layout(font_size=20); fig.show() 

animation time is well-suited to represent the year dimension!

Maps¶

Maps are the obvious representation of geographical data. They are similar to scatterplots

In [38]:
import pandas as pd
# covid-19 italian data downloaded from https://github.com/pcm-dpc/COVID-19/blob/master/dati-regioni/dpc-covid19-ita-regioni.csv on 27-08-2020
data_latest = pd.read_csv("dpc-covid19-ita-regioni.csv") 
In [39]:
center = {"lat": 43.1, "lon": 12.3} # coordinates of center italy (Perugia)
fig = px.scatter_mapbox(data_latest, lon="long", lat="lat",
                        center=center,
                        size="totale_casi", # total cases
                        hover_data= ["denominazione_regione"], # region name
                        zoom=4)
fig.update_traces(textposition='top center')
fig.update_layout(
    width=800,
    height=800,
    title_text='Italian COVID-19 total cases, updated on 27-08-2020',
    #center=center
)
fig.update_layout(mapbox_style="carto-darkmatter") # warning! some styles require an account 
fig.show()

Maps¶

can also be animated, as all other plotly visualizations.

In [40]:
center = {"lat": 43.1, "lon": 12.3}
fig = px.scatter_mapbox(data_latest, lon="long", lat="lat", # longitude, latitude
                        center=center,
                        size="totale_casi", # total cases
                        hover_data= ["denominazione_regione"], # region name
                        animation_frame="data", # date
                        zoom=4)

fig.update_traces(textposition='top center')
fig.update_layout(
    width=800,
    height=800,
    title_text='Cases-Regions',
)
fig.update_layout(mapbox_style="carto-darkmatter") # warning! some styles require an account 
fig.show()